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Abstract 


Translation Memory (TM) systems are 
one of the most widely used translation 
technologies. An important part of TM 
systems is the matching algorithm that de¬ 
termines what translations get retrieved 
from the bank of available translations 
to assist the human translator. Although 
detailed accounts of the matching algo¬ 
rithms used in commercial systems can’t 
be found in the literature, it is widely 
believed that edit distance algorithms are 
used. This paper investigates and eval¬ 
uates the use of several matching algo¬ 
rithms, including the edit distance algo¬ 
rithm that is believed to be at the heart 
of most modern commercial TM systems. 
This paper presents results showing how 
well various matching algorithms corre¬ 
late with human judgments of helpfulness 
(collected via crowdsourcing with Ama¬ 
zon’s Mechanical Turk). A new algorithm 
based on weighted n-gram precision that 
can be adjusted for translator length pref¬ 
erences consistently returns translations 
judged to be most helpful by translators for 
multiple domains and language pairs. 


1 Introduction 


The most widely used computer-assisted transla¬ 
tion (CAT) tool for professional translation of spe¬ 
cialized text is translation memory (TM) technol¬ 


ogy (Christensen and Schjoldager, 2010). TM 


consists of a database of previously translated ma¬ 
terial, referred to as the TM vault or the TM bank 
(TMB in the rest of this paper). When a trans¬ 
lator is translating a new sentence, the TMB is 
consulted to see if a similar sentence has already 
been translated and if so, the most similar pre¬ 
vious translation is retrieved from the bank to 


help the translator. The main conceptions of TM 
technology occurred in the late 1970s and early 
1980s ( |Arthem, 1978[|Kay, 1980[|Melby and oth¬ 


ers, 1981). TM has been widely used since the 


late 1990s and continues to be widely used to 
day (Bowker and Barlow, 2008[ Christensen and 


Schjoldager, 2010[|Garcia, 2007[|Somers, 2003[ ). 

There are a lot of factors that determine how 
helpful TM technology will be in practice. Some 
of these include: quality of the interface, speed of 
the back-end database lookups, speed of network 
connectivity for distributed setups, and the com¬ 
fort of the translator with using the technology. 
A fundamentally important factor that determines 
how helpful TM technology will be in practice is 
how well the TM bank of previously translated 
materials matches up with the workload materials 
to be translated. It is necessary that there be a high 
level of match for the TM technology to be most 
helpful. However, having a high level of match is 
not sufficient. One also needs a successful method 
for retrieving the useful translations from the (po¬ 
tentially large) TM bank. 

TM similarity metrics are used for both evalu¬ 
ating the expected helpfulness of previous transla¬ 
tions for new workload translations and the met¬ 
rics also directly determine what translations get 
provided to the translator during translation of new 
materials. Thus, the algorithms that compute the 
TM similarity metrics are not only important, but 
they arc doubly important. 

The retrieval algorithm used by commercial TM 


systems is typically not disclosed (Koehn and 
Senellart, 2010} Simard and Fujita, 201 2| Why- 
man and Somers, 1999). However, the best¬ 


performing method used in current systems is 


widely believed to be based on edit distance (Bald- 

win and Tanaka, 2000, 

Simard and Fujita, 2012, 

Whyman and Somers, 1999; Koehn and Senellart, 

2010; Christensen and 

Schjoldager, 2010; Man- 

dreoli et ah, 2006 

He 

et al., 2010). Recently 
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Simard and Fujita (2012) have experimented with 
using MT (machine translation) evaluation metrics 
as TM fuzzy match, or similarity, algorithms. A 


limitation of the work of (Simard and Fujita, 2012 1 


was that the evaluation of the performance of the 
TM similarity algorithms was also conducted us¬ 
ing the same MT evaluation metrics. Simard 


and Fujita (2012) concluded that their evalua¬ 
tion of TM similarity functions was biased since 
whichever MT evaluation metric was used as the 
TM similarity function was also likely to obtain 
the best score under that evaluation metric. 

The current paper explores various TM fuzzy 
match algorithms ranging from simple baselines 
to the widely used edit distance to new methods. 
The evaluations of the TM fuzzy match algorithms 
use human judgments of helpfulness. An algo¬ 
rithm based on weighted n-gram precision consis¬ 
tently returns translations judged to be most help¬ 
ful by translators for multiple domains and lan¬ 
guage pairs. In addition to being able to retrieve 
useful translations from the TM bank, the fuzzy 
match scores ought to be indicative of how helpful 
a translation can be expected to be. Many transla¬ 
tors find it counter-productive to use TM when the 
best-matching translation from the TM is not simi¬ 
lar to the workload material to be translated. Thus, 
many commercial TM products offer translators 
the opportunity to set a fuzzy match score thresh¬ 
old so that only translations with scores above the 
threshold will ever be returned. It seems to be a 
widely used practice to set the threshold at 70% 
but again it remains something of a black-box as to 
why 70% ought to be the setting. The current pa¬ 
per uncovers what expectations of helpfulness can 
be given for different threshold settings for various 
fuzzy match algorithms. 

The rest of this paper is organized as follows. 
Section [2] presents the TM similarity metrics that 
will be explored; section [3] presents our experi¬ 
mental setup; section [4] presents and analyzes re¬ 
sults; and section[5]concludes. 

2 Translation Memory Similarity 
Metrics 


tence from the MTBT (Material To Be Translated) 
and C is the source language side of a candidate 
pre-existing translation from the TM bank. The 
metrics range from simple baselines to the sur¬ 
mised current industrial standard to new methods. 

2.1 Percent Match 

Perhaps the simplest metric one could conceive of 
being useful for TM similarity matching is percent 
match (PM), the percent of tokens in the MTBT 
segment found in the source language side of the 
candidate translation pair from the TM bank. 

Formally, 
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where M is the sentence from the MTBT that is 
to be translated, C is the source language side 
of the candidate translation from the TM bank, 
Munigrams is the set of unigrams in M, and 
Cunigrams is the set of unigrams in C. 

2.2 Weighted Percent Match 

A drawback of PM is that it weights the match¬ 
ing of each unigram in an MTBT segment equally, 
however, it is not the case that the value of assis¬ 
tance to the translator is equal for each unigram 
of the MTBT segment. The parts that are most 
valuable to the translator are the parts that he/she 
does not already know how to translate. Weighted 
percent match (WPM) uses inverse document fre¬ 
quency (IDF) as a proxy for trying to weight words 
based on how much value their translations are ex¬ 
pected to provide to translators. The use of IDF- 
based weighting is motivated by the assumption 
that common words that permeate throughout the 
language will be easy for translators to translate 
but words that occur in relatively rare situations 
will be harder to translate and thus more valuable 
to match in the TM bank. For our implementa¬ 
tion of WPM, each source language sentence in 
the parallel corpus we are experimenting with is 
treated as a “document” when computing IDF. 

Formally, 

WPM(M, C) = 


In this section we define the methods for measur¬ 
ing TM similarity for which experimental results 
are reported in section [4] All of the metrics com¬ 
pute scores between 0 and 1, with higher scores 
indicating better matches. All of the metrics take 
two inputs: M and C, where M is a workload sen¬ 
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sentences in the parallel corpus, and idf(x, D) = 
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2.3 Edit Distance 

A drawback of both the PM and WPM metrics 
are that they are only considering coverage of the 
words from the workload sentence in the candi¬ 
date sentence from the TM bank and not taking 
into account the context of the words. However, 
words can be translated very differently depending 
on then - context. Thus, a TM metric that matches 
sentences on more than just (weighted) percentage 
coverage of lexical items can be expected to per¬ 
form better for TM bank evaluation and retrieval. 
Indeed, as was discussed in section [T] it is widely 
believed that most TM similarity metrics used in 
existing systems are based on string edit distance. 


Our implementation of edit distance (Leven- 


shtein, 1966), computed on a word level, is sim¬ 


ilar to the version defined in (Koehn and Senellart, 


[20l0| . 

Formally, our TM metric based on Edit Dis¬ 
tance (ED) is defined as 


ED = max 


edit-dist(M, C) 
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where M, C, and M un i grams are as defined in 
Eq.[lJ and edit-dist(M , C) is the number of word 
deletions, insertions, and substitutions required to 
transform M into C. 

2.4 N-Gram Precision 

Although ED takes context into account, it does 
not emphasize local context in matching certain 
high-value words and phrases as much as metrics 
that capture n-gram precision between the MTBT 
workload sentence and candidate source-side sen¬ 
tences from the TMB. We note that n-gram preci¬ 
sion forms a fundamental subcomputation in the 
computation of the corpus-level MT evaluation 


metric BLEU score (Papineni et al., 2002). How¬ 
ever, although TM fuzzy matching metrics are re¬ 
lated to automated MT evaluation metrics, there 
are some important differences. Perhaps the most 
important is that TM fuzzy matching has to be able 
to operate at a sentence-to-sentence level whereas 
automated MT evaluation metrics such as BLEU 
score are intended to operate over a whole cor¬ 
pus. Accordingly, we make modifications to how 
we use n-gram precision for the puipose of TM 
matching than how we use it when we compute 


BLEU scores. The rest of this subsection and the 
next two subsections describe the innovations we 
make in adapting the notion of n-gram precision to 
the TM matching task. 

Our first metric along these lines, N-Gram Pre¬ 
cision (NGP), is defined formally as follows: 

N i 

NGP = £ j^Pn, (4) 
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where the value of N sets the upper bound on the 
length of n-grams considered and 


Pn = 


I M n -grams O Gn-gr arris 


Z * \Mj 


ii-grams 


+ (1 -Z)* I c, 


(5) 


n-grams \ 


where M and C are as defined in Eq. [TJ M n . grams 
is the set of n-grams in M, C n - gra ms is the set of 
n-grams in C, and Z is a user-set parameter that 
controls how the metric is normal izcd0 

As seen by equation [4] we use an arithmetic 
mean of precisions instead of the geometric mean 
that BLEU score uses. An arithmetic mean is bet¬ 
ter than a geometric mean for use in translation 
memory metrics since translation memory metrics 
are operating at a segment level and not at the 
aggregate level of an entire test set. At the ex¬ 
treme, the geometric mean will be zero if any of 
the n-gram precisions p n are zero. Since large n- 
gram matches are unlikely on a segment level, us¬ 
ing a geometric mean can be a poor method to use 
for matching on a segment level, as has been de¬ 
scribed for the related task of MT evaluation (|Dod-| 


dington, 2002, Lavie et al., 2004). Additionally, 


for the related task of MT evaluation at a segment 
level, Lavie et al. (2004) have found that using 
an arithmetic mean correlates better with human 
judgments than using a geometric mean. 

Now we turn to discussing the parameter Z for 
controlling how the metric is normalized. At one 
extreme, setting Z=1 will correspond to having no 
penalty on the length of the candidate retrieved 
from the TMB and leads to getting longer trans¬ 
lation matches retrieved. At the other extreme, 


1 We used N = 4 in our experiments. 

2 Note that the n in n-grams is intended to be substituted 
with the corresponding integer. Accordingly, forpi, n = 1 
and therefore M n - grarns = M\- grarns is the set of unigrams 
in M and C n - g rams = Ci- gra ms is the set of unigrams in C; 
for p 2 , n = 2 and therefore M n -grams = M 2 -grams is the 
set of bigrams in M and C„- gr ams = C2- gr ams is the set of 
bigrams in C; and so on. 


















setting Z=0 will correspond to a normalization 
that penalizes relatively more for length of the 
retrieved candidate and leads to shorter transla¬ 
tion matches being retrieved. There is a preci¬ 
sion/recall tradeoff in that one wants to retrieve 
candidates from the TMB that have high recall 
in the sense of matching what is in the MTBT 
sentence yet one also wants the retrieved candi¬ 
dates from the TMB to have high precision in the 
sense of not having extraneous material not rele¬ 
vant to helping with the translation of the MTBT 
sentence. The optimal setting of Z may differ 
for different scenarios based on factors like the 
languages, the corpora, and translator preference. 
We believe that for most TM applications there 
will usually be an asymmetric valuation of pre¬ 
cision/recall in that recall will be more important 
since the value of getting a match will be more 
than the cost of extra material up to a point. There¬ 
fore, we believe a Z setting in between 0.5 and 1.0 
will be an optimal default. We use Z=0.75 in all 
of our experiments described in section [3] and re¬ 
ported on in section [4] except for the experiments 
explicitly showing the impact of changing the Z 
parameter. 


2.5 Weighted N-Gram Precision 

Analogous to how we improved PM with WPM, 
we seek to improve NGP in a similar fashion. As 
can be seen from the numerator of Equation [5j 
NGP is weighting the match of all n-grams as 
uniformly important. However, it is not the case 
that each n-gram is of equal value to the transla¬ 
tor. Similar to WPM, we use IDF as the basis of 
our proxy for weighting n-grams according to the 
value their translations are expected to provide to 
translators. Specifically, we define the weight of 
an n-gram to be the sum of the IDF values for each 
constituent unigram that comprises the n-gram. 

Accordingly, we formally define method 
Weighted N-Gram Precision (WNGP) as follows: 


N 1 

WNGP = J2 JjWPn, 
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where N is as defined in Equation [4j and 
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where Z, M n - grams , and C n - grams are as defined 
in Equation [5] and 


w(i)= ^2 idf (1-gram, D), (8) 

1 -gramEi 

where i is an n-gram and idf(x , D) is as defined 
above for Equation [2] 

2.6 Modified Weighted N-gram Precision 

Note that in Equation [6] each wp n contributes 
equally to the average. Modified Weighted N- 
Gram Precision (MWNGP) improves on WNGP 
by weighting the contribution of each wp n so that 
shorter n-grams contribute more than longer n- 
grams. The intuition is that for TM settings, get¬ 
ting more high-value shorter n-gram matches at 
the expense of fewer longer n-gram matches will 
be more helpful since translators will get relatively 
more assistance from seeing new high-value vo¬ 
cabulary. Since the translators already presumably 
know the rules of the language in terms of how 
to order words correctly, the loss of the longer n- 
gram matches will be mitigated. 

Formally we define MWNGP as follows: 




MWNGP = 


N 


2 n - l Z-j 2 n 

71=1 


Wpn , 


(9) 


where N and wp n are as they were defined for 
Equation [6] 


3 Experimental Setup 

We performed experiments on two corpora from 
two different technical domains with two language 
pairs, French-English and Chinese-English. Sub¬ 
section 3.1 discusses the specifics of the corpora 


and the processing we performed. Subsection 3.2 
discusses the specifics of our human evaluations of 
how helpful retrieved segments are for translation. 










3.1 Corpora 


For Chinese-English experiments, we used the 


OpenOffice3 (003) parallel coipus (Tiedemann, 


2009), which is 003 computer office productiv¬ 


ity software documentation. For French-English 
experiments, we used the EMEA parallel cor¬ 
pus (Tiedemann, 2009 ), which are medical docu¬ 
ments from the European Medecines Agency. The 
corpora were produced by a suite of automated 
tools as described in ( [Tiedemann, 2009| ) and come 
sentence-aligned. 

The first step in our experiments was to pre- 
process the corpora. For Chinese corpora we to- 
kenize each sentence using the Stanford Chinese 
Word Segmenter (Tseng et al., 2005) with the Chi¬ 
nese Penn Treebank standard ( Xia, 2000] ). For all 
corpora we remove all segments that have fewer 
than 5 tokens or more than 100 tokens. We call 
the resulting set the valid segments. For the pur¬ 
pose of computing match statistics, for French cor¬ 
pora we remove all punctuation, numbers, and sci¬ 
entific symbols; we case-normalize the text and 
stem the corpus using the NLTK French snowball 
stemmer. For the puipose of computing match 
statistics, for Chinese corpora we remove all but 
valid tokens. Valid tokens must include at least 
one Chinese character. A Chinese character is de¬ 
fined as a character in the Unicode range 0x4E00- 
0x9FFF or 0x4000-0x4DFF or 0xF900-0xFAFF. 
The rationale for removing these various tokens 
from consideration for the purpose of comput¬ 
ing match statistics is that translation of numbers 
(when they’re written as Arabic numerals), punc¬ 
tuation, etc. is the same across these languages 
and therefore we don’t want them influencing the 
match computations. But once a translation is se¬ 
lected as being most helpful for translation, the 
original version (that still contains all the numbers, 
punctuation, case markings, etc.) is the version 
that is brought back and displayed to the transla¬ 
tor. 


For the TM simulation experiments, we ran¬ 
domly sampled 400 translations from the 003 
corpus and pretended that the Chinese sides of 
those 400 translations constitute the workload 
Chinese MTBT. From the rest of the coipus we 
randomly sampled 10,000 translations and pre¬ 
tended that that set of 10,000 translations consti¬ 
tutes the Chinese-English TMB. We also did simi¬ 
lar sampling from the EMEA corpus of a workload 
French MTBT of size 300 and a French-English 


TMB of size 10,000. 

After the preprocessing and selection of the 
TMB and MTBT, we found the best-matching 
segment from the TMB for each MTBT seg¬ 
ment according to eac h TM retrieval metric de¬ 
fined in section 2[] The resulting sets of 
(MTBT segment,best-matching TMB segment) 
pairs formed the inputs on which we conducted 
our evaluations of the performance of the various 
TM retrieval metrics. 

3.2 Human Evaluations 

To conduct evaluations of how helpful the transla¬ 
tions retrieved by the various TM retrieval metrics 
would be for translating the MTBT segments, we 
used Amazon Mechanical Turk, which has been 
used productively in the past for related work in 


the context of machine translation (Bloodgood and 
Callison-Burch, 2010b] Bloodgood and Callison- 
Burc h, 201 0a; Callison-Burch, 2009[ ). 

For each (MTBT segment,best-matching TMB 
segment) pair generated as discussed in subsec¬ 
tion ED we collected judgments from Turkers 
(i.e., the workers on MTurk) on how helpful 
the TMB translation would be for translating the 
MTBT segment on a 5-point scale. The 5-point 
scale was as follows: 


• 5 = Extremely helpful. The sample is so sim¬ 
ilar that with trivial modifications I can do the 
translation. 

• 4 = Very helpful. The sample included a large 
amount of useful words or phrases and/or 
some extremely useful words or phrases that 
overlapped with the MTBT. 

• 3 = Helpful. The sample included some use¬ 
ful words or phrases that made translating the 
MTBT easier. 

• 2 = Slightly helpful. The sample contained 
only a small number of useful words or 
phrases to help with translating the MTBT. 

• 1 = Not helpful or detrimental. The sample 
would not be helpful at all or it might even be 
harmful for translating the MTBT. 

After a worker rated a (MTBT segment,TMB 
segment) pair the worker was then required to give 

3 If more than one segment from the TMB was tied for 
being the highest-scoring segment, the segment located first 
in the TMB was considered to be the best-matching segment. 



























metric 

PM 

WPM 

ED 

NGP 

WNGP 

MWNGP 

PM 

100.0 

69.5 

23.0 

32.0 

31.5 

35.5 

WPM 

69.5 

100.0 

25.8 

37.0 

39.0 

44.2 

ED 

23.0 

25.8 

100.0 

41.5 

35.8 

35.0 

NGP 

32.0 

37.0 

41.5 

100.0 

77.8 

67.0 

WNGP 

31.5 

39.0 

35.8 

77.8 

100.0 

81.2 

MWNGP 

35.5 

44.2 

35.0 

67.0 

81.2 

100.0 


Table 1: 003 Chinese-English: The percent of the 
time that each pair of metrics agree on the most 
helpful TM segment 


metric 

PM 

WPM 

ED 

NGP 

WNGP 

MWNGP 

PM 

100.0 

64.7 

30.3 

40.3 

38.3 

41.3 

WPM 

64.7 

100.0 

32.0 

46.3 

47.0 

54.3 

ED 

30.3 

32.0 

100.0 

42.3 

40.3 

39.3 

NGP 

40.3 

46.3 

42.3 

100.0 

76.3 

67.7 

WNGP 

38.3 

47.0 

40.3 

76.3 

100.0 

81.3 

MWNGP 

41.3 

54.3 

39.3 

67.7 

81.3 

100.0 


Table 2: EMEA French-English: The percent of 
the time that each pair of metrics agree on the most 
helpful TM segment 

an explanation for their rating. These explanations 
proved quite helpful as discussed in section [4] For 
each (MTBT segment,TMB segment) pair, we col¬ 
lected judgments from five different Turkers. For 
each (MTBT segment,'TMB segment) pair these 
five judgments were then averaged to form a mean 
opinion score (MOS) on the helpfulness of the re¬ 
trieved TMB translation for translating the MTBT 
segment. These MOS scores form the basis of our 
evaluation of the performance of the different TM 
retrieval metrics. 

4 Results and Analysis 

4.1 Main Results 

Tables [T] and [2] show the percent of the time that 
each pair of metrics agree on the choice of the 
most helpful TM segment for the Chinese-English 
003 data and the French-English EMEA data, re¬ 
spectively. A main observation to be made is that 
the choice of metric makes a big difference in 
the choice of the most helpful TM segment. For 
example, we can see that the surmised industrial 
standard ED metric agrees with the new MWNGP 
metric less than 40% of the time on both sets of 
data (35.0% on Chinese-English 003 and 39.3% 
on French-English EMEA data). 

Tables [3] and |4] show the number of times each 
metric found the TM segment that the Turkers 
judged to be the most helpful out of all the TM 
segments retrieved by all of the different metrics. 
From these tables one can see that the MWNGP 


Metric 

Found Best 

Total MTBT Segments 

PM 

178 

400 

WPM 

200 

400 

ED 

193 

400 

NGP 

251 

400 

WNGP 

271 

400 

MWNGP 

282 

400 


Table 3: 003 Chinese-English: The number of 
times that each metric found the most helpful TM 
segment (possibly tied). 


Metric 

Found Best 

Total MTBT Segments 

PM 

166 

300 

WPM 

184 

300 

ED 

148 

300 

NGP 

188 

300 

WNGP 

198 

300 

MWNGP 

201 

300 


Table 4: EMEA French-English: The number of 
times that each metric found the most helpful TM 
segment (possibly tied). 

method consistently retrieves the best TM segment 
more often than each of the other metrics. Scat- 
terplots showing the exact performance on every 
MTBT segment of the 003 dataset for various 
metrics are shown in Figures [TJ [2j and [3] To con¬ 
serve space, scatterplots are only shown for met¬ 
rics PM (baseline metric), ED (strong surmised 
industrial standard metric), and MWNGP (new 
highest-performing metric). For each MTBT seg¬ 
ment, there is a point in the scatterplot. The y- 
coordinate is the value assigned by the TM metric 
to the segment retrieved from the TM bank and 
the x-coordinate is the MOS of the five Turkers 
on how helpful the retrieved TM segment would 
be for translating the MTBT segment. A point 
is depicted as a dark blue diamond if none of 
the other metrics retrieved a segment with higher 
MOS judgment for that MTBT segment. A point 
is depicted as a yellow circle if another metric re¬ 
trieved a different segment from the TM bank for 
that MTBT segment that had a higher MOS. 

A main observation from Figure [I] is that PM is 
failing as evidenced by the large number of points 
in the upper left quadrant. For those points, the 
metric value is high, indicating that the retrieved 
segment ought to be helpful. However, the MOS 
is low, indicating that the humans are judging it 
to not be helpful. Figure [2] shows that the ED 


metric does not suffer from this problem. How¬ 
ever, Figure [2] shows that ED has another prob¬ 
lem, which is a lot of yellow circles in the lower 
left quadrant. Points in the lower left quadrant are 
not necessarily indicative of a poorly performing 
metric, depending on the degree of match of the 
TMB with the MTBT workload. If there is noth¬ 
ing available in the TMB that would help with 
the MTBT, it is appropriate for the metric to as¬ 
sign a low value and the humans to correspond¬ 
ingly agree that the retrieved sentence is not help¬ 
ful. However, the fact that so many of ED’s points 
are yellow circles indicates that there were better 
segments available in the TMB that ED was not 
able to retrieve yet another metric was able to re¬ 
trieve them. Observing the scatterplots for ED and 
those for MWNGP one can see that both methods 
have the vast majority of points concentrated in 
the lower left and upper right quadrants, solving 
the upper left quadrant problem of PM. However, 
MWNGP has a relatively more densely populated 
upper right quadrant populated with dark blue di¬ 
amonds than ED does whereas ED has a more 
densely populated lower left quadrant with yel¬ 
low circles than MWNGP does. These results and 
trends are consistent across the EMEA French- 
English dataset so those scatterplots are omitted 
to conserve space. 

Examining outliers where MWNGP assigns a 
high metric value yet the Turkers indicated that the 
translation has low helpfulness such as the point 
in Figure [3] at (1.6,0.70) is informative. Looking 
only at the source side, it looks like the translation 
retrieved from the TMB ought to be very help¬ 
ful. The Turkers put in their explanation of their 
scores that the reason they gave low helpfulness 
is because the English translation was incorrect. 
This highlights that a limitation of MWNGP, and 
all other TM metrics we’re aware of, is that they 
only consider the source side. 

4.2 Adjusting for length preferences 

As discussed in section [2} the Z parameter can be 
used to control for length preferences. Table [5] 
shows how the average length, measured by num¬ 
ber of tokens of the source side of the translation 
pairs returned by MWNGP, changes as the Z pa¬ 
rameter is changed. 

Table [6] shows an example of how the opti¬ 
mal translation pair returned by MWNGP changes 
from Z=0.00 to Z=1.00. The example illustrates 



Figure 1: 003 PM scatterplot 



Figure 2: 003 ED scatterplot 



Figure 3: 003 MWNGP scatterplot 





MTBT 

French: Ne pas utiliser durant la gestation et la lactation, car T innocuite du 
medicament veterinaire n’ a pas ete etablie pendant la gestation ou 
la lactation. 

English: Do not use during pregnancy and lactation because the safety of the 
veterinary medicinal product has not been established during 
pregnancy and lactation. 

MWNGP 

(Z=0.00) 

French: Peut etre utilise pendant la gestation et la lactation. 

English: Can be used during pregnancy and lactation. 

MWNGP 

(Z=1.00) 

French: Ne pas utiliser chez T animal en gestation ou en periode de lactation, 
car la securite du robenacoxib n’ a pas ete etablie chez les femelles gestantes ou 
allaitantes ni chez les chats et chiens utilises pour la reproduction. 

English: Do not use in pregnant or lactating animals because the safety of 
robenacoxib has not been established during pregnancy and lactation or in cats 
and dogs used for breeding. 


Table 6: This table shows for an example MTBT workload sentence from the EMEA French-English data 
how the optimal translation pair returned by MWNGP changes when going from Z = 0.00 to Z = 1.00. 
We provide the English translation of the MTBT workload sentence for the convenience of the reader 
since it was available from the EMEA parallel corpus. Note that in a real setting it would be the job of 
the translator to produce the English translation of the MTBT-French sentence using the translation pahs 
returned by MWNGP as help. 


Z Value 

Avg Length 

Z Value 

Avg Length 

0.00 

9.9298 

0.00 

7.2475 

0.25 

13.204 

0.25 

9.5600 

0.50 

16.0134 

0.50 

11.1250 

0.75 

19.6355 

0.75 

14.1825 

1.00 

27.8829 

1.00 

25.0875 

(a) EMEA French-English 

(b) 003 Chinese-English 


Table 5: Average TM segment length, measured 
by number of tokens of the source side of the trans¬ 
lation pairs returned by MWNGP for varying val¬ 
ues of the Z parameter 

the impact of changing the Z value on the na¬ 
ture of the translation matches that get returned 
by MWNGP. As discussed in section [2j smaller 
settings of Z are appropriate for preferences for 
shorter matches that are more precise in the sense 
that a larger percentage of their content will be 
relevant. Larger settings of Z are appropriate for 
preferences for longer matches that have higher re¬ 
call in the sense that they will have more matches 
with the content in the MTBT segment overall, al¬ 
though at the possible expense of having more ir¬ 
relevant content as well. 

5 Conclusions 

Translation memory is one of the most widely 
used translation technologies. One of the most 
important aspects of the technology is the system 


for assessing candidate translations from the TM 
bank for retrieval. Although detailed descriptions 
of the apparatus used in commercial systems are 
lacking, it is widely believed that they are based 
on an edit distance approach. We have defined 
and examined several TM retrieval approaches, in¬ 
cluding a new method using modified weighted n- 
gram precision that performs better than edit dis¬ 
tance according to human translator judgments of 
helpfulness. The MWNGP method is based on the 
following premises: local context matching is de¬ 
sired; weighting words and phrases by expected 
helpfulness to translators is desired; and allowing 
shorter n-gram precisions to contribute more to the 
final score than longer n-gram precisions is de¬ 
sired. An advantage of the method is that it can be 
adjusted to suit translator length preferences of re¬ 
turned matches. A limitation of MWNGP, and all 
other TM metrics we are aware of, is that they only 
consider the source language side. Examples from 
our experiments reveal that this can lead to poor 
retrievals. Therefore, future work is called for to 
examine the extent to which the target language 
sides of the translations in the TM bank influence 
TM system performance and to investigate ways 
to incorporate target language side information to 
improve TM system performance. 
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